[Feat] Adds LongCat-AudioDiT pipeline by RuixiangMa · Pull Request #13390 · huggingface/diffusers

RuixiangMa · 2026-04-02T17:19:18Z

What does this PR do?

Adds LongCat-AudioDiT model support to diffusers.

Although LongCat-AudioDiT can be used for TTS-like generation, it is fundamentally a diffusion-based audio generation model (text conditioning + iterative latent denoising + VAE decoding) rather than a conventional autoregressive TTS model, so i think it fits naturally into diffusers.

Test

import soundfile as sf
import torch
from diffusers import LongCatAudioDiTPipeline

pipeline = LongCatAudioDiTPipeline.from_pretrained(
    "meituan-longcat/LongCat-AudioDiT-1B",
    torch_dtype=torch.float16,
)
pipeline = pipeline.to("cuda")

audio = pipeline(
    prompt="A calm ocean wave ambience with soft wind in the background.",
    audio_end_in_s=5.0,
    num_inference_steps=16,
    guidance_scale=4.0,
    output_type="pt",
).audios

output = audio[0, 0].float().cpu().numpy()
sf.write("longcat.wav", output, pipeline.sample_rate)

Result

longcat.wav

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: Lancer <maruixiang6688@gmail.com>

src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py

dg845 · 2026-04-07T05:23:19Z

src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py

+    )
+
+
+def _pixel_shuffle_1d(hidden_states: torch.Tensor, factor: int) -> torch.Tensor:


Similarly, I think we should inline _pixel_shuffle_1d in UpsampleShortcut following #13390 (comment).

src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py

src/diffusers/models/transformers/transformer_longcat_audio_dit.py

dg845 · 2026-04-07T06:18:24Z

src/diffusers/models/transformers/transformer_longcat_audio_dit.py

+        self.time_embed = AudioDiTTimestepEmbedding(dim)
+        self.input_embed = AudioDiTEmbedder(latent_dim, dim)
+        self.text_embed = AudioDiTEmbedder(dit_text_dim, dim)
+        self.rotary_embed = AudioDiTRotaryEmbedding(dim_head, 2048, base=100000.0)
+        self.blocks = nn.ModuleList(


Suggested change

self.time_embed = AudioDiTTimestepEmbedding(dim)

self.input_embed = AudioDiTEmbedder(latent_dim, dim)

self.text_embed = AudioDiTEmbedder(dit_text_dim, dim)

self.rotary_embed = AudioDiTRotaryEmbedding(dim_head, 2048, base=100000.0)

self.blocks = nn.ModuleList(

self.time_embed = AudioDiTTimestepEmbedding(dim)

self.input_embed = AudioDiTEmbedder(latent_dim, dim)

self.text_embed = AudioDiTEmbedder(dit_text_dim, dim)

self.rotary_embed = AudioDiTRotaryEmbedding(dim_head, 2048, base=100000.0)

self.blocks = nn.ModuleList(

See #13390 (comment).

src/diffusers/models/transformers/transformer_longcat_audio_dit.py

dg845 · 2026-04-07T06:23:31Z

src/diffusers/models/transformers/transformer_longcat_audio_dit.py

+        batch_size = hidden_states.shape[0]
+        if timestep.ndim == 0:
+            timestep = timestep.repeat(batch_size)
+        timestep_embed = self.time_embed(timestep)
+        text_mask = encoder_attention_mask.bool()
+        encoder_hidden_states = self.text_embed(encoder_hidden_states, text_mask)


Suggested change

batch_size = hidden_states.shape[0]

if timestep.ndim == 0:

timestep = timestep.repeat(batch_size)

timestep_embed = self.time_embed(timestep)

text_mask = encoder_attention_mask.bool()

encoder_hidden_states = self.text_embed(encoder_hidden_states, text_mask)

batch_size = hidden_states.shape[0]

if timestep.ndim == 0:

timestep = timestep.repeat(batch_size)

timestep_embed = self.time_embed(timestep)

text_mask = encoder_attention_mask.bool()

encoder_hidden_states = self.text_embed(encoder_hidden_states, text_mask)

Can you also refactor forward here so that it is better organized, following #13390 (comment)? See for example the QwenImageTransformer2DModel.forward method:

diffusers/src/diffusers/models/transformers/transformer_qwenimage.py

Line 836 in d7bc233

def forward(

Reorganized parts of forward incrementally; kept the current structure otherwise to avoid unnecessary behavioral churn.

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

src/diffusers/models/transformers/transformer_longcat_audio_dit.py

src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py

dg845 · 2026-04-11T09:04:13Z

tests/models/transformers/test_models_transformer_longcat_audio_dit.py

+class TestLongCatAudioDiTTransformerMemory(LongCatAudioDiTTransformerTesterConfig, MemoryTesterMixin):
+    def test_layerwise_casting_memory(self):
+        pytest.skip("LongCatAudioDiTTransformer does not support standard layerwise casting memory tests yet.")
+
+    def test_layerwise_casting_training(self):
+        pytest.skip("LongCatAudioDiTTransformer does not support standard layerwise casting training tests yet.")
+
+    def test_group_offloading_with_layerwise_casting(self, *args, **kwargs):
+        pytest.skip(
+            "LongCatAudioDiTTransformer does not support combined group offloading and layerwise casting tests yet."
+        )


Suggested change

class TestLongCatAudioDiTTransformerMemory(LongCatAudioDiTTransformerTesterConfig, MemoryTesterMixin):

def test_layerwise_casting_memory(self):

pytest.skip("LongCatAudioDiTTransformer does not support standard layerwise casting memory tests yet.")

def test_layerwise_casting_training(self):

pytest.skip("LongCatAudioDiTTransformer does not support standard layerwise casting training tests yet.")

def test_group_offloading_with_layerwise_casting(self, *args, **kwargs):

pytest.skip(

"LongCatAudioDiTTransformer does not support combined group offloading and layerwise casting tests yet."

)

class TestLongCatAudioDiTTransformerMemory(LongCatAudioDiTTransformerTesterConfig, MemoryTesterMixin):

pass

Layerwise casting should work if #13390 (comment) is applied.

I removed the layerwise casting training and combined group-offloading/layerwise-casting skips after updating the dtype handling. I kept test_layerwise_casting_memory skipped
because the tiny transformer config does not provide stable peak-memory behavior for that assertion.

tests/models/transformers/test_models_transformer_longcat_audio_dit.py

tests/pipelines/longcat_audio_dit/test_longcat_audio_dit.py

dg845

Thanks for your continued work on this! Left some suggestions that should help LongCatAudioDiTPipeline support model offloading, layerwise casting, etc.

Signed-off-by: Lancer <maruixiang6688@gmail.com>

dg845 · 2026-04-14T02:02:23Z

@bot /style

github-actions · 2026-04-14T02:02:47Z

Style bot fixed some files and pushed the changes.

yiyixuxu · 2026-04-14T02:51:14Z

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

+
+    @classmethod
+    @validate_hf_hub_args
+    def from_pretrained(


can you add a conversion script?
our pipeline should not define from_pretrained method

can you add a conversion script? our pipeline should not define from_pretrained method

Added it and tested.

yiyixuxu · 2026-04-14T02:51:54Z

@claude can you help with a review here?

github-actions · 2026-04-14T02:52:15Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

dg845 · 2026-04-14T02:57:57Z

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

+        timesteps = self.scheduler.timesteps
+        self._num_timesteps = len(timesteps)
+
+        for i, t in enumerate(timesteps):


Can you add support for a progress bar here? For example, here is how Flux 2 implements a progress bar with self.progress_bar:

diffusers/src/diffusers/pipelines/flux2/pipeline_flux2.py

Lines 955 to 956 in 5063aa5

with self.progress_bar(total=num_inference_steps) as progress_bar:

for i, t in enumerate(timesteps):

This will make it easier to track progress during inference.

Can you add support for a progress bar here? For example, here is how Flux 2 implements a progress bar with self.progress_bar:

diffusers/src/diffusers/pipelines/flux2/pipeline_flux2.py

Lines 955 to 956 in 5063aa5

with self.progress_bar(total=num_inference_steps) as progress_bar:

for i, t in enumerate(timesteps):

This will make it easier to track progress during inference.

Done

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:03<00:00, 13.68it/s]

Signed-off-by: Lancer <maruixiang6688@gmail.com>

Add LongCat-AudioDiT pipeline

63d874d

Signed-off-by: Lancer <maruixiang6688@gmail.com>

RuixiangMa changed the title ~~Longcataudiodit~~ [Feat] Adds LongCat-AudioDiT support Apr 2, 2026

RuixiangMa changed the title ~~[Feat] Adds LongCat-AudioDiT support~~ [Feat] Adds LongCat-AudioDiT pipeline Apr 2, 2026

upd

d2a2621

Signed-off-by: Lancer <maruixiang6688@gmail.com>

RuixiangMa force-pushed the longcataudiodit branch from 9c4613f to d2a2621 Compare April 2, 2026 17:37

dg845 requested review from dg845 and yiyixuxu April 4, 2026 00:31

upd

354c983